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Neural Networks for Classification 


by 


William Christopher Pritchett, M.S.E. 
The University of Texas at Austin, 1998 


SUPERVISOR: Irwin W. Sandberg 


In many applications, ranging from character recognition to signal de- 
tection to automatic target identification, the problem of signal classification 1s 
of interest. Often, for example, a signal is known to belong to one of a family of 
sets C1,..., C4, and the goal is to classify the signal according to the set to which 
it belongs. The main purpose of this thesis is to show that under certain condi- 
tions placed on the sets, the theory of uniform approximation can be applied to 
solve this problem. Specifically, if we assume that sets C; are compact subsets of 
a normed linear space, several approaches using the Stone- Weierstrass theorem 
give us a specific structure for classification. This structure is a single hidden 
layer feedforward neural network. We then discuss the functions which comprise 


the elements of this neural network and give an example of an application. 
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1. Signal Classification 


Signal classification is, quite simply, the process of examining a signal 
and determining a class, or group, from which it came. Humans perform many 
instances of signal classification each day, often without even knowing it. For 
example, one might read a signature (the signal) carefully to determine the 
author (the class). This might be a process that would be extremely hard for a 
computer to perform. 

There are numerous applications in military, civilian, and academic prob- 
lems that require the use of the field of signal classification. It would be fruitless 
to attempt to compile an exhaustive list of applications, so we will state and 
develop a few problems here in which the theory of signal classification plays 


an important role in the solution. 


Automatic Target Recognition 


The field of automatic target recognition is extremely important, primar- 
ily in the area of the military. The main purpose of automatic target recognition 
is the use of computer processing to detect and recognize signatures in sensor 
data [1]. These targets are most often in a cluttered environment and frequently 
in hostile territory. They may include such things as aircraft, missiles, tanks, 
or warships. The clutter in their background may come from temperature or 
pressure disturbances, atmospheric variations, topographical objects, or even 
other targets. 


There are typically two steps to an automatic target recognition problem: 


1 





detection and identification. Usually some relatively fast and coarse method is 
used to detect an object from background noise, and a slower more precise 
method is used to identify it. Typical features that are required to be extracted 
from the target when it is detected often include its position, its size and shape, 
and its speed. 

In order to measure these quantities, an automatic target recognition 
system will possess sensors such as high resolution cameras and complex radar 
arrays. These sensors will obtain data and send it to the processing portion of 
the system. The system will then determine first whether a target even exists 
and then attempt to identify the target. 

It is immediately very clear that the second portion of the problem (the 
identification) is basically a pure classification problem. Once it is determined 
that a tank is found, for example, it is important to be able to quickly determine 
whether the tank is friendly or hostile. An automatic recognition system thus 
frequently consists of several modules, one of which is the classifier. 

Usually the classifier is designed with the assumption that each input, 
once found, belongs to only one of the classes. This assumption will become 
important later because it will allow us to make use of some well-known math- 


ematical theorems in order to determine when classification may be possible. 


Pattern Recognition 


A second application of the theory of signal classification is in the field of 


pattern recognition. This is an extremely broad field, concerning a wide range 
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of problems of practical interest, including character recognition and speech 
identification. 

One classical application is the reading of characters written either by 
hand or by machine. This application has a wide range of uses in government 
and commercial industry. For example, computers used by the post office are 
able to indentify machine-written letters on envelopes in order to sort them. 
Another important area deals with financial institutions. In these cases, the 
problem typically deals with classifying an input character into one of the thirty- 
six classes formed by the characters in the alphabet and the ten numerals. The 
area of printing is usually prescribed, so it is easy to locate and segment the 
characters. Some form of sampling is usually done, and then an algorithm 
determines the character. 

There are also several problems in the field of speech recognition that rely 
heavily on classification theory. These problems include the following: speaker 
identification, speaker verification, and isolated word recognition [16]. In a 
speaker verification system, the number of classes relates to the number of 
different individuals that one wishes to recognize. In isolated word recognition, 
the number of classes will depend on the "vocabulary" of the system and may 
be as large as 10,000. 

Many problems dealing with pattern recognition are found in the area 
of medicine as well. There are many applications that result in continuous 
functions, two-dimensional gray scale images, and time-varying images. These 


include results from electocardiograms, electroencephalograms, and X-ray im- 
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ages, to name a few. Cell analyzers classify blood cells in a population and 
determine cell type. Signal classification routines are of enormous importance 
in gathering fast information from these and other biological data. 

These are just some of the many real-world applications in which signal 
classification plays a very important role. This makes it necessary to develop 
routines which are capable of performing well in signal processing problems. It 
is in this light that we consider the problem of determining a structure suitable 


for classification. 








2. Neural Networks 


It has long been recognized that the human brain functions in a com- 
pletely different way from the modern digital computer. There has been a great 
interest in studying how the human brain works and in determining whether it 
is feasible to design a model capable of solving problems in a similar manner. 
Ramón and Cajál in 1911 introduced the concept of neurons as the basic ele- 
ments of the brain [11]. It has been determined that neurons process information 
one hundred thousand to one million times slower than a basic silicon gate chip. 
The brain compensates for this slower speed by possessing in the neighborhood 
of 10 billion neurons and 60 billion synapses, or interconnections between the 
neurons [21]. As a result the brain is capable of performing many tasks at rates 
much greater than even the fastest computer. It is in an attempt to emulate 
this capability of the brain that the field of neural networks, or artificial neural 
networks, was born. 

The history of neural networks dates back to the 1940's, when McCulloch 
and Pitts in 1943 proposed a computational model of an element resembling a 
neuron [3]. After some initial research, the idea faded until interest began to 
return in the 1980's. Since then, the field of neural networks has grown rapidly, 
with interest from researchers in a number of fields ranging from engineering to 
physics to psychology. 

A neural network, essentially, is a structure that attempts to model the 
way the brain performs some task and then to perform that task in a similar 


manner. The structure may be electronically built or simulated in software, for 
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example. À neural network will contain a large number of individual cells, which 
model the neurons, and a number of interconnections between them, which 
model the synapses. Often the information passed through the interconnections 
will be multiplied by constants in order to achieve a certain task. This is known 
as weighting. Haykin gives a definition as adapted from Aleksander and Morton 


in 1990: 


A neural network is a massively parallel distributed processor that 
has a natural propensity for storing experimental knowledge and 


making it available for use. It resembles the brain in two respects: 


1. Knowledge is acquired by the network through a learning pro- 


cess. 


2. Interneuron connection strengths known as synaptic weights 


are used to store the knowledge. 


The learning process mentioned here is often an attempt to modify the 
interconnection weights in order to accomplish the designated task. This at- 
tempt compares with the well-known field of adaptive filter theory, where filter 
weights are adapted over time until they approach a steady-state value. 

There are many benefits that arise from neural networks' inherent struc- 


ture. The following are some of them (see [11]). 


1. Nonlinearity. The functions performed by the neurons are nonlinear; 


therefore the entire network, which is a weighted connection of these neu- 





rons, will also be nonlinear. This helps in modeling typical applications, 


which are often nonlinear. 


. Input-output Mapping. One way in which the values for the weights used 
in the interconnections of the neural network are obtained is by a process 
called training. An example input is given, and weights are chosen so 
that the error between the actual output and some known desired output 
is minimized. This training procedure is repeated until the values of the 
weights reach a steady state (if possible). Thus the neural network learns 


by creating an input-output mapping. 


. Adaptivity. À neural network has the property of adapting its synaptic 
weights in order to match a change in the surrounding environment. When 
it is operating in one environment, it may be retrained to operate in 
another environment which has only minimal changes. Further, a neural 
network operating in a nonstationary environment is able to adapt its 


weights in real time. 


. Evidential Response. A neural network, when faced with a choice, is often 
able not only to select the right choice, but to give a confidence about the 
choice it made. For example, a neural network used for classification and 
given an input signal may output the class for that signal as well as how 


sure it 1s that that is actually the correct class. 


. Fault Tolerance. Since each of the many neurons in a neural network 


stores an important bit of information, the network's power is distributed 
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over each of these neurons. This allows the network in theory to continue 
operating even when one of the neurons fails, though with some degrada- 
tion in performance. Neural networks are thus often marked by a gradual 


decay in performance instead of a single catastrophic failure. 


6. Uniformality of Analysis and Design. Because all neural networks are sim- 
ilar in a structural sense and the same notation is used in the applications 
of neural networks to different problems, they are in a sense universal. 


This is seen in the following properties: 


e Neurons are common to all neural networks. 


e This commonality allows for the sharing of information between neu- 


ral networks in different applications. 


e It is possible to build modular networks easily simply by integrating 
the different modules. In other words, parts of different networks 
(or even entire networks) may be used easily in conjunction with one 


another to create a new network. 


As neurons are the building blocks of a neural network, their modeling is 
most important. The basic design for a neuron is fairly simple. A set ofsynapses 
are input to the neuron. These interconnections are weighted by real numbers, 
the synaptic weights. These weighted values are then summed. Finally, this sum 
is passed through a (typically) nonlinear activation function. This function 
usually serves to limit the output of the neuron to some desired range, for 


example [0,1] or [-1,1]. An example of this model of a neuron is shown in 
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Figure 1: Nonlinear model of a neuron 


Figure 1. 

While the neurons themselves are modeled more or less the same regard- 
less of the application, there are different architectures for the actual network. 
We will be concerned with just one particular type, called a feed-forward net- 
work with one hidden layer. This network architecture consists of a large number 
of neurons arranged schematically in three layers. This may be seen in Figure 
2 

In theory, each unit of the input layer may be connected to each unit of 
the hidden layer. This connection has a weight, which as mentioned above is a 
real number, associated with it. The weights are denoted by w;;. So each unit 
on the hidden layer receives a weighted sum of elements from the input layer 
and then processes this sum with an activation function. Finally, the result of 
this activation is transmitted to the output layer with another set of weights 
and then summed. The result for the network structure shown in Figure 2 is: 
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Figure 2: A feed-forward neural network 


Finally, it is important to note that it is not necessarily possible to 
solve any problem simply by constructing a neural network at random and then 
attempting to train the weights. It is important to determine when a solution 
will be possible and what structure of network to try. Later it will be shown that 
a certain type of neural network is capable of solving an important classification 


problem. 
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3. Background 


Metric Spaces 


A type of space that will play a particularly important role in the study 
of approximation is a metric space. They are described in detail in many books, 


for example [9], [13], and [18]. 


Definition: A metric space is a pair (X, p) where X is a set of elements and 
p is a metric, or distance function, that is nonnegative and real-valued with the 


following properties: 
1. p(z,y) = 0 if and only if z = y; 
2. p(x,y) = ply, 1); 
3. plz,y) + ply,z) < plz, z). 


Some examples of metric spaces are: 


Example 1: The set of real numbers with metric p(z, y) — |x — y|, referred to 


as IR or JR’. 


Example 2: The set of all ordered n-tuples x = (2), 22,...2,), with metric 


Pat) = Y > (xk — yk)?. This space is generally referred to as IR”. 
k=] 


Example 3: The set of continuous functions defined on a closed interval fa, b] 


with metric p(f, g) — max |/(t) — g(8)| 
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Example 4: This same set of continuous functions along with the metric 


ai ai ([ Lie) - st)? a)" 


form a different yet equally valid metric space (known as L*(JR"). Thus, the 
metric as well as the set of points must be known in order for the space to be 


completely determined. 


Let X be a metric space with zo € X and let r » 0. We define an open 
ball with radius r centered about zo (written b(zo,r)) to be the set of points 
x € X such that p(z, Zo) < r. Let A C X. We define a point x € A to be an 
interior point of the set A if b(x,r) C A for some r > 0. That is, we can find 
an open ball surrounding the point z such that every point in the ball belongs 
to the set A. It is in this way that we go about defining open sets in a metric 
space. In fact, a set A C X is called an open set if all of its points are interior 


points. 


Example 1: Consider the set (0,1) in JR. Given any point in the set, it is 
possible to choose an open ball of some radius such that the ball is contained 


n (0,1). Therefore, (0,1) is open in R. 


Example 2: On the other hand, consider the set [0,1) in JR and look at any 
open ball about the point 0 with radius r. Whatever the choice of r, there will 
be points contained in the ball that are not in [0,1) (for example, the point 
—r/2); therefore the point 0 is not an interior point of the set [0, 1), Therefore 
the set is not open. 


Let X be a metric space and z € X. We define a neighborhood of z asa 
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set containing an open set containing x. This open set will necessarily contain 
an open ball b(xp, €) for some € > 0. Therefore, every neighborhood of a point 
will contain an open ball of that point. Again let X be a metric space and let 
AC X. A point z € X is called a contact point of A if every neighborhood of 
x contains at least one point in A. Obviously all x € A are contact points of 
A. If every neighborhood of x contains infinitely many points in A, then z is 
called a limit point of A. Note that a limit point is necessarily a contact point 
by definition. The closure of a set A, written as A, is simply the set of all the 
contact points of A. A set which is equivalent to its closure, (A = A) is known 


as a closed set. 


Example 1: Consider again the set [0,1) in JR. It is not possible to find an 
open ball about the point 1 that does not contain any points in [0, 1). Therefore 
every neighborhood of 1 contains at least one point (in fact, every neighborhood 
contains infinitely many points) in the set (0,1). This implies that 1 is a contact 
point (and a limit point) of the set [0, 1). Since 1 £ [0,1), the set does not 


coincide with its closure (in fact, as expected, [0,1) = [0,1]) and is therefore 


not a closed set. 


Example 2: On the other hand, the set [0,1] can be shown to be closed as its 


closure is the very same set [0, 1]. 


One of the most important concepts concerning metric spaces is that of 
continuity. Let (X, pz) and (Y, p) be metric spaces and let f be a function such 


that f : X — Y. Then f is continuous at the point p € X if for every € > 0 
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there exists a ó » 0 such that p,(f(x), f(p)) < e whenever p;(z, p) « 6. 

A sequence (z, ) in a metric space X is said to converge if there is a point 
p € X with the following property: For every e > O there is an integer N such 
that n > N implies that p(z,,p) « e. We write this as z, — p or lim Tn = p. 
We define {xn} to be a Cauchy sequence in a metric space X if for every e > 0, 
there exists a positive integer N such that |2,, — Zm| < e for رع‎ m > N. We can 
easily show that a sequence converges if and only if it is a Cauchy sequence. 
A metric space is said to be complete if every Cauchy sequence converges to a 
point in the space. The completeness of certain metric spaces 1s very important 
to proving results in those spaces. 

In a similar manner, we say that a sequence of functions { f,,} from X to 
IR converges uniformly on X to a function f if for every e > O there exists an 
integer IN such that n 2 N implies |f,(x) — f(x)| « € for all x. We often write 
this as f, —^ f uniformly. For a discussion in greater depth of convergence, see 


[19]. 


Topological Spaces 


Although metric spaces are usually the most general space needed, there 
may be times when a result may be proved for a more general space. It is for 


this purpose that we now introduce the topological space. 


Definition: A topological space is the pair (X, 7) consisting of a set of points 
X and a topology 7, where 7 is a family of subsets G C X, called open sets, 


with the following properties: 
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1. The set X itself and the empty set À belong to 7. 


71 
2. Arbitrary unions U G, and finite intersections (| G, of open sets belong 
0 k=l 


to 7. 


The definitions of open and closed sets in a topological space X is quite 
simple. A set AC X is an open set if A belongs to 7. A set B in a topological 
space X is a closed set if its complement X — B is open. 

We can also extend the concepts of a neighborhood, contact point, limit 
point, and closure of a set in a topological space. By a neighborhood of z, we 
mean any open set G containing x. A point ze X is a contact point of T C X 
if every neighborhood of x contains at least one point in T. A point x € X isa 
limit point of T C X if every neighborhood of x contains infinitely many points 
in T. Finally, the closure of a subset T of a topological space X is the set of all 
the contact points of T. 

Two important types of topological spaces are Hausdorff spaces and nor- 


mal spaces. A topological space X is called a Hausdorff space if: 
1. Sets consisting of single points are closed. 


2. For every pair of distinct points z and y in X, there are disjoint neigh- 


borhoods of z and y. 
A topological space is called a normal space if: 


1. Sets consisting of single points are closed. 
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2. For every pair of disjoint closed sets A and B, there are disjoint neighbor- 


hoods of 4 and B. 


Obviously, every normal set is Hausdorrf, though a Hausdorff set need 
not be normal. It can be verified that all metric spaces are topological spaces 
simply by taking 7 to be the family of open sets that are open in the metric 
space in the usual sense. This is very important as it allows any result relating 
to topological spaces to be applied to metric spaces as well. In fact, we get an 
even better result: all metric spaces are normal (and therefore Hausdorff). The 


contrasts, however, to both of these statements are not true. 


Example: The topological space consisting of only two points {0,1} where 7 


consists only of the sets (0, 1) (the entire space) and ® is not a metric space. 


Continuity in a topological space is a somewhat different concept than 
continuity in a metric space as well. Let (X,7;) and (Y, 7,) be two topological 
spaces and let f : X — Y. Then f is continuous if f~'(A) € Ts for every A in 
Ty. In other words, continuity implies that the inverse image of an open set is 
open. 

A family M of subsets M, of a topological space X is called a cover of 
Xif XC U Ma. If the sets M, consist entirely of open sets, then we call the 
family an open cover. A topological space is compact if every open cover has a 
finite subcover. 

Although metric spaces possess many of the nice properties that we 


would like to have for topological spaces, it is not true that all metric spaces 
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are compact. There are some theorems (see for example [14]), however, that 
allow us to determine whether a given metric space is compact without having 
to view it as a topological space. 

Let A and B be subsets of the metric space X. Then the set A is called 
an e-net for the set B if there exists a point Za € A such that for e > 0 any 


2122 رد ام‎ a) e. 


Theorem 1 (Hausdorff). For compactness of a set M of a metric space X it 
is necessary that there should exist a finite e-net of the set M for every e > 0. 


If the space X is complete, then the condition is also sufficient. 


Roughly speaking, a set is compact if we can find a finite number of 
points and take open balls centered at those points such that the union of all 
the open balls contains the set. There are some improvements to this if we 


consider certain specific spaces. 


Example 1: (Heine-Borel). A subset of /R is compact if and only if it is closed 


and bounded. 


Example2: (Arzela). The functions of a set A are said to be uniformly 
bounded if there exists a constant K such that |r(t)| < K for all z(t) € A. 
The same functions are equicontinuous if given e > 0, there exists a ó > O such 
that |x(t1) — z(t2)| « e whenever |t; — t3| « 0. A set A C CT[0,1], the space 
of real-valued continuous functions on the closed interval [0, 1], is compact if 


and only if A is closed and the functions x € A are uniformly bounded and 


1 


a یس اي‎ wem 
at ag "rei Int 1 | کس ھت تو‎ 
C'-— a O ee o ے‎ 
AA A y eee ب‎ UMM 





equicontinuous. 


Linear Spaces 
We now introduce the concept of a linear space. 
Definition: A nonempty set L is called a linear space if it satisfies the following 


axloms: 


1. Any two elements z € L, y € L uniquely determine a third element 
x+y € L called the sum of x and y that satisfies the following properties: 
(a) z +y = y+ و‎ (commutativity); 
(b) (z 4- y) - z 2 xz + (y + z) (associativity); 


(c) L contains an element 0, called the zero element such that for all 


ZEL 24+ 0 - 20: 


(d) For each x € L, there exists an element —z € L such that +(—x) = 


0, where 0 is the zero element; 


2. There exists a product operation such that any element x € L and any 


number a determine a unique element oz € L such that: 


(a) a(Bx) = (af)x 


(b) Leet 


3. The operations of addition and multiplication obey the following distribu- 


tive axioms: 
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(a) (a + 8)x = ax + B; 


(b) a(z + y) = az + ay. 


The elements z, y, ... of a linear space are often called vectors, and the 
entire space is often called a vector space. The numbers o, f, ... are referred 
to as scalars and the entire set of allowable scalars is referred to as the field. 
Typically, the field is the set of real numbers, in which case the space is referred 
to as a real linear space. A subset Lo of a linear space L is referred to as a linear 
subspace of L if Lo itself is a linear space over the same field as L. 

It is possible that a linear space possess no topology whatsoever as long 
as it satisfies the three properties above. However, in many applications the 
concepts of a linear space and topological space are combined. A space that 
is both a linear space and a topological space is referred to either as a linear 
topological space or a topological vector space. We require additionally only 
that the vector operations of addition and multiplication (which are not always 
the usual addition and multiplication) be continuous in the topology 7. It is 
possible too to apply the concept of a metric to a linear space, but what is more 
useful is to define an operation a bit more specific than a metric, called a norm, 


and apply it to a linear space. 


Normed Linear Spaces 


Definition: A linear space L equipped with an operation called a norm (|| - ||) 


is called a normed linear space if || - || satisfies the following three properties: 


19 





1. ||z|| > 0 for all x where ||z|| = 0 if and only if x = 0; 
2. |lax|| = |e} ||z|| for all z € L and all a; 


3. ||] + اه‎ > |lz|| + ||u]| for all x and y in L. 


Just as every metric space is also a topological space, every normed linear 
space may also be considered a metric space (and therefore a topological space 


as well) by taking the metric to be: 


p(x, y) = |lz — vll. 
Again, the converse is not true. 


Example: The metric space consisting of the closed interval [0, 1] with the 
“discrete metric” p(x,y) = 1 if x # y and p(x,1) = 0 cannot be made into a 


normed linear space. 


A normed linear space that is complete (in the same sense that a metric 
space is complete) is known as a Banach space. 

One special Banach space is called a Hilbert space. 
Definition: A Hilbert space is a Banach space with the norm ||z|| =< x, x > 


where < -,- > is an inner product with the following properties (assuming the 


space is real): 
jr i جے شیا‎ 
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e > > Ofra r £ 0. 


The most common example of a Hilbert space is the n-dimensional space 


IR", with the Euclidean norm ||z|| 2 4/522. z? where z = (x1, 22,..., 24). 


The Hahn-Banach Theorem and Separation in Linear Spaces 


One of the most important and fundamental results in all real analysis 
is the Hahn-Banach theorem. There are many different forms of the theorem 
and in most cases any version of the theorem can be used to directly prove 
any other version. It is first necessary to introduce the idea of convex sets and 


convex functionals. 


Definition: A set M C L is called a convex set if for each pair of points zx, 
y € M, all points on the line segment joining z and y (that is, all points of the 


form kz + (1 + k)y, 0 € k < 1) are also elements of M. 


Definition: A functional p defined on a real linear space L is said to be convex 


if it has the following properties: 
l. p(ax) = ap(x) for all z € L and all a > 0; 


2. plz + y) > p(x) + p(y) for all x, y € L. 


We now turn to the idea of extending a linear functional. Suppose we 
have a linear functional defined on a certain subspace. We want to know whether 


there exists a linear functional on the entire space that is equal to our first 
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functional on the subspace. The Hahn-Banach theorem tells us when this 1s 


possible. 


Theorem 2 (Hahn-Banach) Let p be a finite convex functional defined on a 
real linear space L and let Lo be any subspace of L. Let fo be any linear 


functional on Lo satisfying the condition 


fo(z) < plz) 


on Lo. Then there exists a linear functional f on L, called the extension of fo 


such that f = fo at every point of Ly and f(x) < p(x) on L. 


Proof: We can assume that Lo 4 L. Let z be any element of L — Lo, and let 
L be the subspace generated by Lo and the element z, this being the set of all 
linear combinations of the form x +tz (x € Lo,t € IR). For f to be an extension 


of fo onto L, we need 


f(a +tz) = f(x) + f(tz) = fo(xz) + tf(z) 


Now, let c = f(z) and note that if f is an extension onto L then folz) + tc < 


p(z + tz). This condition can easily translate to the two conditions: 
c<p(z/t+z)- fo(x/t)ift>0andc>-p(-x/t— 2) — folz/t) ift < 0 


So what remains is to show that there is always a c satisfying these conditions. 


In this light, let yı and ya be elements of Lo. Then 


folY2 — y1) foly2) — fo(y1) > وهام‎ - y1) 


DE yo pn tz) peu a: 
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So we get 


— fo(ya) + p(yo + z) 2 -folyı) - pl-yı 2). 


Now let cı = supl= fo(yn) — p(—y1 — 2)] and cz = inf ys[— fo(y2)  p(ya + 2)]. 
Then co > cı and it simply remains to choose cp > c > cı and note that c 
satisfies the necessary conditions. So the functional f, defined on L, satisfies 
the condition f(x) € p(z) for x € L. An induction argument not given here 


proves the case when L is the entire space L. 


By applying the Hahn-Banach theorem, we may show a somewhat more 


useful result, given in [2]. 


Theorem 3 Let f be a bounded linear functional defined on the subspace L of 
the real normed linear space X. Then, there exists a bounded linear functional 


F defined on the entire space X so that F(x) = f(x) for z € L and ||F|| 2 ||/ ||.' 


Proof: Since f is a bounded linear functional, then for z € L, |f(z)| e Uri, 
For x € X define p(x) = ||f||||z||. It is then easy to show that p is convex and 
that f(x) < p(x). By the Hahn-Banach Theorem, extend f to a new functional 
F defined on all of X such that F(z) € p(z) = ||fl|ljx|| and F(x) = f(x) for 
z € L. Clearly, F is bounded and ||F|| < ||f||- Similarly, if z € L, then |f (x)| = 
|F(z)| € |F||l|z||, implying ||f|| € ||F||. Combining the two inequalities, we see 
that ||F|| 2 || f|| and the proof is complete. 


! The norm operator ||- ||, when applied to a bounded linear functional on a normed linear 
space X (as is the case here) is defined as ||f|| = sup |f(z)|. Further, ||f|| can easily be 
||| >1 


shown to have the following properties: ||f|| = sup A and |f(x)| € ||f|ll|z|| for all z € X. 
z#0 
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We now turn to perhaps the most useful corollary of the Hahn-Banach 
theorem. It is very desirable in many situations to know that there are a suf- 
ficient number of bounded linear functionals defined on a space to strictly sep- 
arate the elements of that space. By strictly separate, we mean that for any 
two elements x; and zo of a linear space X, there exists an f € X*, the set of 
bounded linear functionals on X, such that f(zi1) — f(x2) £ 0. We prove this 


in the context of the following theorem. 


Theorem 4 Let X be a normed linear space and z9 € X, zo # 0. Then there 


exists an F € X* such that ||F|| 2 1 and F(z) = |||]. 


Proof: Let L be the linear subspace of X generated by taking the linear span 
of zog. All elements in L will thus have a representation axo, a € IR. Define 
the function f on L by f(azx,) = allxo||. It is seen at once that f(zo) = ||zo|| 
simply by taking a = 1. We can then extend f to a bounded linear functional 
F defined on the whole space X as noted in the previous theorem. Since F=f 
on L, F(xo) = f (zo) = ||zo||. It thus remains only to show that ||F|| 2 1. For 


any x € L, we see that 


| (2)] — [f(oo)] — lolllzol] — llexoll — liil. 
implying that ||f|| 2 1 and therefore ||F|| = 1 by the previous theorem. 


To prove our assertion about the strict separation of elements in a linear 
space by the functionals defined on that space, let X be a normed linear space 


and x; and z> be distinct elements in X. Further, let f € X*. Now define 
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Xo = 11 — xa and see that xp Æ 0 since xı and zz are distinct. We may now 


apply the previous theorem to get 


f(zi— 22) = f (z0) 7 lol] 7 0. 
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4. The Stone-Weierstrass Theorem and Uniform Approximation 


In many applications, it is desirable to know whether a certain class of 
IR-valued functions may be useful in uniformly approximating a larger group of 
IR-valued functions. Weierstrass proved that it is possible to uniformly approx- 
imate any continuous functional on a compact subset of IR” by a polynomial in 
n variables. Since that time, there have been several different proofs of Weier- 
starass' theorem. One of the most useful is the one given by M. H. Stone in 
[23]. His primary result, which will be shown, generalizes Weierstrass' result in 
that it allows the domain to be any compact set (instead of just any compact 
subset of IR”) and the set of approximating functions to be a set other than 
polynomials (which may not have meaning on a general compact set). 

In order to generalize the theorem, we can view the polynomials as a 
subset of the set from which we obtain the approximating functional. We seek to 
know what functions may be derived from a certain set of prescribed functions by 
the specified algebraic operations of addition, multiplication, multiplication by 
real numbers and uniform passage to the limit. The set of prescribed functions 
for the polynomials, for example, consists of just two functions: fı(z) = 1 
and fo(x) = x defined on a bounded closed interval X of IR. From these two 
functions and the algebraic operations alone, the set of all polynomials may 
be formed. Weierstrass’ theorem then tells us that the uniform passage to the 
limit of this set (the polynomials) is the set of all continuous functionals on X. 


Equivalently, the set of continuous functionals is the uniform closure of the set of 
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polynomials, or the continuous functions on X may be uniformly approximated 
by the set of polynomials. 

In order to begin proving this generalized theorem, it is instructive to 
consider the case of a general topological space X where the specified algebraic 


operations are the lattice operations V and A defined to be: 
fvg=max(f,9) and fAg=min(f,g) 
These form the functions h and k defined as: 
h(r)- max(f(z),g(z) ^ and ^ k(z) — min(f(z), g(z)) 


for any z € X. Let C be the set of all continuous real functions on X and 
Co be a prescribed subfamily of C. We want to obtain the family U (Co) of all 
functions which can be formed from the functions in Cy by the application of 
the specified algebraic operations and uniform passage to the limit. In the case 
of the lattice operations, it is easily observed that U(C6) is a part of C closed 


under uniform passage to the limit, that is 
U(Co) C C, | U(U(Co)) — U(Co). 
The first property may be shown by observing that the mappings 
z —2 max(f (x), g(z)) and ده‎ —9 min(f (x), g(x)) 


are continuous. This follows from the continuity of f and g (necessarily true 
since Co is a subfamily of C) and the continuity of the max and min mappings. 


Now since the uniform limit of continuous functions is also a continuous function, 
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clearly U(C6) C C. To show that U(U(Co)) = U(Co), we can form U(Cg) in 
two steps. First, let U, (Co) be the set containing all the functions obtained by 
applying the lattice operations alone to the functions in Co. Then let Uo(Co) 
be the set consisting of the functions obtained from those in U, (Co) by uniform 


passage to the limit. Clearly, 
Co € U1 (Co) € U5(Co) C U(Co). 


It remains to show that U2(Co) is closed under the allowable operations, and 
therefore U2(Co) = U(Co). Let f be a function which is a uniform limit of 
functions f, in U2(Co). Then f must also be in U2(Co) since given e > O, 
there exists a function g, in Uı(Co) such that |f, - امه‎ < €/2 since Us(Co) is, 
by definition, the functions obtained by passing those in U, (Co) to a uniform 
limit. Also, |f — f,| < €/2 since our definition of f was a uniform limit of fh. 
Therefore, |f — gn| < € and f is a uniform limit of functions gn, in U, (Co) and 
therefore a member of Uz(C,) We must now show that whenever f and g are in 
U2(Co), then so are f V g and f Ag. This can be done by observing that if f 
and g are uniform limits of functions f, and q, in U¡(Cp), then f Vg and f ^g 


are uniform limits of fan V gn and fn ^ gn, respectively. 


Theorem 5 Let X be a compact space, C the family of all continuous real 
functions on X, Co an arbitrary subfamily of C, and U (Co) the family of all 
functions (necessarily continuous) generated from Co by the lattice operations 
and uniform passage to the limit. Then a necessary and sufficient condition for 


a function f in C to be in U(C¿) is that, whatever the points z, y € X and 
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whatever the positive number e, there exists a function f,, obtained by applying 


the lattice operations alone to C, and such that 


|f (a) = fey(x)| < e and MU) = 0 ec 


Proof: The necessity is obvious. A proof of the sufficiency, which is not com- 
plicated, is given in [23]. There, Stone also notes the following corollary to the 


theorem. 


Corollary 1: If C, has the property that, whatever the points z, y € X,z y 
and whatever the real numbers o and f, there exists a function fp in Co for 


which fo(z) — o and fo(y) = 8, then U(Co) 2 C. 


This tells us that the way in which a function f acts on pairs of points in 
X determines whether it can be approximated U(Cy). This observation leads 


to the following theorem. 


Theorem 6 Let X be a compact space, C the family of all continuous (neces- 
sarily bounded) real functions on X, Co an arbitrary subfamily of C and U (Co) 
the family of all functions (necessarily continuous) generated from Co by the 
linear lattice operations and uniform passage to the limit. Then a necessary 
and sufficient condition for a function f in C to be in U(Co) is that f satisfy 
every linear relation of the form ag(z) = Pg(y) , «® > 0, which is satisfied by 
all functions in Cg. The linear relations associated with an arbitrary pair of 


points z, y in X must be equivalent to one of the following distinct types: 


l. g(z) — 0 and g(y) — 0; 


29 





2. g(x) = 0 and g(y) unrestricted, or vice versa; 
3. g(x) = g(y) without restriction on the common value; 


4. g(x) = Ag(y) or g(y) = Ag(z) for a unique value A, 0 < A « 1. 


Corollary 1: In order that U(Co) contain a nonvanishing constant function, it 
is necessary and sufficient that the only linear relations of the form ag(z) = 


Bg(y), «8 > 0, satisfied by every function on Cy be those reducible to the form 


g(x) = gly). 


Proof: It is obvious that when U(Co) contains a nonvanishing constant func- 
tion then conditions (1), (2), and (4) can never be satisfied, so only (3) must be 


considered. 


Corollary 2: In order that U(Co) = C, it is sufficient that the functions in Xo 


satisfy no linear relation of the form (1)-(4) of Theorem 1. 


This is an important corollary because in practice it is easy to consider 
a set of functions with the property that. all functions do not satisfy all of the 


relations (1)-(4). 


Definition: A family of arbitrary functions on a domain X is said to be a 
separating family (for that domain) if, whenever X and y are distinct points 
of X, there is some function f in the family with distinct values f(z), f(y) at 


these points. 


Corollary 3: If X is compact and if C, is a separating family for X and contains 
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a nonvanishing constant function, then U(Co) = C. 


Proof: Since C, contains a nonvanishing constant function, it may satisfy only 
condition (3) of Theorem 2. However, since Cy is a separating family, there is 
a function f € Cy such that f(x) > f(y) for x, y in X. So condition (3) is not 
satisfied by all functions in Co. Therefore none of the conditions are satisfied 


by Co and therefore U(Co) = C. 


We now consider the case where U(C6) is built from the functions in 
Co C C using the operations of addition, multiplication, multiplication by real 
numbers (the linear ring operations), and uniform passage to the limit. If f and 
g are uniform limits of the sequences f, and g, respectively, the product fg is 
not in general the uniform limit of the sequence fngn. We therefore require that 
the set C consist of the bounded continuous functions on X. Of course, this is 


satished automatically when X is compact. This leads to the general theorem. 


Theorem 7 Let X be a compact space, C the family of all continuous (nec- 
essarily bounded) functions on X, C, an arbitrary subfamily of C and U(C;) 
the family of all functions generated from C$ by the linear ring operations and 
uniform passage to the limit. Then a necessary and sufficient condition for a 
function f in C to be in U(C$) is that f satisfy every linear operation of the 


form g(x) = 0 or g(z) = g(y) which is satisfied by all functions in Xo. 


Proof: As a lemma, one can show (see [23]) that if f is in U(Co) then so is 


|f|. This means that f is the uniform limit of functions in Cy subject to the 
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linear ring operations. Using a well known representation of the min and max 
functions: 


max(a,b) = (a + b 4- |a — b|) 
min(a, b) — ala +b—la— bl) 


we can now see that whenever f and g are in U(C$) then f V g and f Ag 
are in U(Co) as well. So U(C9) is closed under the linear lattice operations as 
well as the linear ring operations and uniform passage to the limit. Therefore 
the results in Theorem 2 are applicable here. It remains to show that every 
function in U(X,) cannot satisfy linear relations of the form given in condition 
(4) of Theorem 2. Assume g(x) = Ag(y) for every function g in U (Co) and every 
z, yin X, for 0 < A < 1. Then for every f in U(C,), f” is also in U(C,) and the 
relations f^(z) = Af?(y) and Af?(y) 2 A?f?(y) would hold, implying that either 
f(y) = 0 for every f in U(C,) or A = 0,1, the second being a contradiction to 
the assumption. So we conclude that f is in U(Co) if and only if it satisfies all 


relations of the form g(z) = 0 or g(x) = g(y) satisfied by those functions in Co. 


We give a definition in order to restate the general theorem. 


Definition: A family A of real functions defined on a set X is said to be an 
algebra if (i) f +g € A, (ii) fg € A, and (iii) cf € A for all f € A, g E A and 
for all real constants c, that is, if A is closed under addition, multiplication, and 


multiplication by real numbers. 
An equivalent form of the general theorem that is often used in practice 
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is stated in [19] as follows: 


Theorem 8 Let .A be an algebra of real continuous functions on a compact set 
K. If A separates points of K and if A does not vanish at any point in K, then 


any real continuous function on K may be approximated by an element of .A. 


An argument in [4] extends the theorem to certain normed linear spaces that 


are not necessarily compact. 


Theorem 9 Let X be a normed linear space (or, indeed, any Hausdorff topo- 
logical space). If A is a subalgebra of C(X), the continuous functions on X, 


that contains constants and separates the points of X, then A is dense in C(X ). 


Proof: Let f be any element of C(X). We must prove that each neighborhood 
of f contains an element of A. Let K be a compact set in X and e a positive 
number. By restricting f and all members of A to the compact set K, we 
can apply the classical version of the Stone-Weierstrass Theorem in C(K). Its 


conclusion is that the set 


{glk : g © A} 
is dense in C(K). Hence there is an element و‎ in A such that [lf — gl] kx < e. 
Now we give some examples from Stone's original article. 


Theorem 10 Let X be an arbitrary bounded closed subset of n-dimensional 
Cartesian space, the coordinates of a general point being z;,...,24,. Any con- 
tinuous real function f defined on X can be uniformly approximated by polyno- 


mials in the variables z;,...,7,. In case the origin x = (0,...,0), the function 
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f càn be uniformly approximated by polynomials vanishing at the origin if and 
only if f itself vanishes at the origin. Otherwise f can be uniformly approxi- 


mated by such polynomials without qualification. 
This is the classical approximation theorem proved by Weierstrass. 


Theorem 11 Let f be an arbitrary continuous real function of the real variable 
0,0 € 0 < 2r, subject to the periodicity condition f(0) — f(2-). Then f 
can be uniformly approximated on its domain of definition by trigonometric 
polynomials of the form 
p(0) = > RR دا‎ cos nO + bn sin nô). 
n=l 
Theorem 12 Any continuous real function f, which is defined on the interval 


0 S z « oo and vanishes at infinity in the sense that lim f(x) = 0, can be 


approximated by functions of the form e ??p(x) where p(x) is a polynomial. 


Theorem 13 Any continuous real function f which is defined on the interval 


-00 « z « +00 and which vanishes at infinity in the sense that 


dim f(x)= lim f(x) =0 


T—>+00 


can be uniformly approximated by functions of the form e “2 notz where p(x) 


is a polynomial. 


Several of these examples will prove useful shortly. 
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5. Neural Network Approximation of Continuous Maps 


We now will examine a structure that has been proven in useful for ap- 
proximation. The structure will be based almost entirely on a proof in [20]. We 
assume that we have a normed linear space X and a subset C that is nonempty 
and compact. We let X* represent the set of bounded linear functionals on X 
and Y represent a set of continuous maps which are dense in X* on C in the 
usual sense. That is, for each ó € X* and for some c » O, there exists a y € Y 
such that |ó(x) — y(x)| < « for z € C. Further, for k — 1,2,3,... we let D, be 
any family of continuous maps h : JR* +} JR such that given a compact E C IR* 
and any continuous g : E — JR as well aso > 0 there exists an h € D, such that 
Ig(z) — h(x)| « o for z € E. Let U be any set of continuous maps U : IR 4 R 
such that given o > 0 and any bounded interval (01, 82) C IR there exists a 


finite number of elements u1, ..., u of U for which | exp(8) — 55; u;(8)| « a for 


p e (81, Bo). 


Theorem 14 (Sandberg) Let f : C — IR. Then the following conditions are 


equivalent. 
(2) f is continuous. 


(ii) Given e > 0 there are a positive integer k, real numbers c;,..., ck, elements 


Ui,..., Uk Of U, and elements yi,..., yx of Y such that 


Io) - Y csujlusla)]| <e 
for z € C. 
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(iii) Given € > 0 there are a positive integer k, elements y,...,yk Of Y, and 


an h € D, such that 


If(z) - hlyi(z),.--; ye(z)]| < € 


101 2 cec. 


Proof: First, assume condition (i) holds. Let V be the set of all functions 
v: C > IR such that 

=> exp(¢,(z)), 
in which the sum is finite and a; € IR and 9; € X*. To see that V constitutes 


an algebra as defined above, observe that 


exp(ó(z)) exp(v(z)) = exp(%(z) + 0(z)) 7 exp(6  v)(z). 


Taking q = 0 we can see that V contains constants. Finally, we have demon- 
strated previously that the Hahn-Banach theorem guarantees that we can choose 
an z and y in C such that ó(z — y) # 0. Therefore, exp(d(z)) # exp(d(y)), so 
V separates the points of C. We may now apply the Stone-Weierstrass theorem 
guaranteeing uniform approximation on compacta. In other words, for e > 0, 
there are a positive integer n, real numbers d,,...,d,, and elements 2,,..., Zn 
of X* such that 
ET d; exp(z;(z))| « 3 

for z € C. 

Assume that 0; |d;| # 0. Choose y > 0 such that y 57; |d;| « €/3. Let 


[a,b] be an interval in JR that contains all of the sets z;(C), and let a € IR 
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and b € IR such that a <a and b >b. That is, the interval [a,b] contains the 
interval [a,b]. Now, choose y > 0 such that | exp(i) — exp(B2)| < y for fi, 
B» € [a,b] with |G, — Ba] < v. Clearly this is possible because of the continuity 
of the exponential function. Set p — min(v,a —a, b — b ) and choose y; € Y such 
that |z;(x) —y;(z)] « p, x € C for all j. This gives | exp(z;(z)) —exp(z;(x))| « 7, 


x € C for each J. Now using a version of the triangle inequality, this gives: 


- à,exp(y;(z)| < m 


+ | اوه د‎ 2® — (y; (x 
2 /3 د ې‎ ld; تک‎ )) — exp(y;(z))] 
J 
< 2e/3, 
forz EC. 
Now we choose u;,...,u, € U so that 


|exp(8) — » B)| <n, 8 € (a, b] 


where yı 57; |d;| « 6/3. Then, 
Hx) — 2» djuilys(z)]| < Id exp[y;(z)]| + 24 exp[y;(z)] 


- 2 ک ا[(ھ) رتا ره ے‎ (2e)/8 + سار‎ (z)] — dj 2 uilyilz) 


< (20)/3+ E lallexplu(a)] — Pe € (29)/34- 5 2۳ از‎ > » 


7 


Now, since 2J; 2); dj;u;[y; ()] is equivalent to 3. c;u;[y;(2)], with the c;, 


uj, and y; in IR, U, and Y, respectively, we have shown that (i) > (ai). 


37 





=> -— e — — q. "e um 





h Approximation 
to f(x) 


Figure 3: À general structure for approximation 


To show that (ii) — (iii), let e > O and suppose that there exist k, 


Cres CG ANd iE ug such that 


M = 2 ,cju;ly;(z)]| « e/2,z € C. 


Let h € Dy satisfy |h(A) — Z; cju; (4)| « e/2 for A € [a, b)". Then 
Io) - h(ln (2), .. m G]] < IE) -E culote) 


> < 6/2 + 6/2 > ]])( ...۰ ,(ع) hly‏ - ((ع)رلتارسن ر< | 
j‏ 

for € C. 

Finally, (iii) — (i) as f is a uniform limit of continuous functions and 
therefore continuous itself. 

This proof has demonstrated a general structure that may be used for 
approximation. This structure is shown below in Figure 3. 

Part (iii) of the theorem shows that the y;'s are simply functions which 


are capable of approximating linear functionals defined on the space X (these 
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may actually be linear functionals themselves) while the structure for h is simply 
a continuous memoryless nonlinear system capable of approximating uniformly 
on compacta in /R*. In other words, the problem of approximating a function 
whose domain may be any compact subset of any normed linear space has been 
reduced to the problem of approximating a function on IR", a subject about 
which a great deal is known, and has been shown to some extent in dealing 
with the Stone-Weierstrass theorem. Stiles, Sandberg, and Ghosh have shown 
in [22] that structures of a similar form have use in the approximation of certain 
nonlinear discrete time mappings as well. 

Part (12) of the theorem gives a specific example of the structure of 
the network. Again it takes the y;’s to be uniform approximations of linear 
functionals on X. Here one possible structure for h is shown as below in Figure 
4. The u;'s, as mentioned before, are drawn from a set capable of uniform 
approximation of the exponential function on a bounded set in /R. In the 
simplest case, from the perspective of the theorem, each u; may be taken to be 
the function exp(:). 

a moment we will determine possible choices for the elements u; in‏ صا 
the approximation network. Now we will look at a similar method of dealing‏ 
with this problem given in [4], [7], and [24]. We start be defining a certain class‏ 


of functions, called ridge functions and then immediately give the theorem. 


Definition: A function f : X > R is called a ridge function if it may be 
represented in the form f — go 4, where g : IR + IR and à € X*, where X* is 


the space of continuous linear functionals on X. An alternative equivalent form 
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Figure 4: À structure for h 


of this composite function is f(x) = g(¢@(z)) for x € X. 
It can easily be shown, for example, that all ridge functons on IR” can 


be written in the form 


f(x) = glaırlı ta2la ++ Gate) 
where x = (C, Co, 5 E) e IR”. 


Theorem 15 (Cheney) Let G be a fundamental set in C(JR)? and let X be a 


normed linear space. Let ® be a subset of X* such that the set 


۵/ ||) :۵ 6 5, ۵ #0 


is dense in the unit sphere of X*. Then the set of ridge functions {god:g € 
6,۵ € ®*} is fundamental in C(X).° 


2A subset Y of X is said to be fundamental in X if its linear span is dense in X. Thus, 


71 
there are elements yı,...,Yn € Y such that for any x € X and e > 0, |z — 5 cjyj| € e where 
j=l 


c; € IR. 
3C(X) is, of course, the set of continuous, real-valued functions on the normed linear space 
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Proof: Let f be a member of C(X), C a compact set in X, and e> 0. We 


have shown above that there exist u; € C(IR) and y; € X* such that 


I -Zuon > 3 


for x € C. By adjusting the functions u; as necessary, we can assume that 
او‎ 2 1 for 1 < j < m. Let M - supzec ||z]. Choose 4 > 0 so that when 
|s| € M, |t| € M, and |s — t| « ó we get |u;(s) — u;(t)| < €/3P for 1 € j € P. 
This is, of course, possible because the u; are continuous. Now select ¢; € ® so 
that ||¢;/||¢;|| —y,;|| < 6/M for 1 <j < P. Let A; = 1/||¢,|| and u = max; ||6;|). 


Select aj, E IR and gjx € G so that for |T| < uM we have 
N 
|; (Aj) = >; Ask 9jk(t)| < e/3P (1 < 7 < PBF 
٨ 
Now let x € C. Then ||z|| > M, |y;(z)| < M, |A;6;(z)| < M, and 
Iy;(z) — Ajés(z)] € IIzllllusllllu; — (ارہ ز۸‎ < M(6/M) = 6. 
From the definition of d (i.e., let s = y;(x) and t = A;ó;(x)) we get 
P P 
| 2, h;(y;(z)) 7 $5 h;0595(2))] € ۰۰/۵۳ = - 3 
7-1 7-1 


Now, because |¢;(x)| < ||¢;||||z|| < 4M, the definition of aj, and g; gives 


٢ 77 ajk9;(%;(7))| < SN E‏ 2) رهز ا 


J= ge 


عم 
e‏ 
عم 


Now, by a simple application of the triangle inequality, we get 


( - 


i=l k 


"a 


p 
, 7 جا إزره‎ 3 ۶٦ 


.نا 
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Aj giro; on ©‏ 24 ر2 — ((2) روود )نا رد | + |((2)زحارة رح - ) ۸# اا وې 


j=l k=l 


ہے 


5 
Since 5 > ajkgjk(Ó;(x)) may be written as 5 c;g;(P;(x)), we get the desired 
7-1 ادر‎ 
result: 


5 
|((۵:)2)رووه و‎ > € for x € C. 


j=1 

We note many similarities between this proof and part of Sandberg's. 
The set of functions G in Cheney's theorem is similar to the set of functions U 
in Sandberg's, but the requirement in Sandberg's theorem on U is less stringent. 
The set U is required only to approximate one specific function in C (IR), namely 
the exponential function, exp(/), on a certain bounded set. Cheney's theorem, 
on the other hand, requires that the set © be fundamental in CUR). This means 
that any continuous function defined on a compact set in IR is capable of being 


approximated by the set G. 
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6. Approximation and Classification 


As previously mentioned, the problem of classifying signals plays an ım- 
portant role in a variety of problems. We attempt to provide the framework 
for a solution to some of these problems by restating the problem in a more 
mathematical sense. 

We assume first that all of the signals to be classified are drawn from 
a normed linear space. For simplicity, we will further assume that each signal 
may belong only to one of the classes. For example, assume that there are n 
different classes C,,...,C, that are all subsets of a normed linear space X, and 
that each signal received must necessarily belong to exactly one of the classes. 

We now have the framework whereby we can view the classifier as a 
mathematical function f that takes the signal to be classified as input and 
produces the desired class as output. For example, if x € C}, then f(x) = aj, 
where a;,...,a, are all distinct integers, would model a classification system 
whereby each element of class C; be mapped to the integer aj. A graph of this 
simple function is shown in Figure 5. Our assumption that each signal may 
belong to only one class means that the sets C; are pairwise disjoint. 

In order to apply the theorems that we have developed, it is helpful to 
assume that the sets C; are compact. This assumption will, of course, exclude 
certain classification problems from the scope of these theorems. We now can 
let C = U C;. The set C will now also be compact as it is the union of a finite 
number of compact disjoint sets. Finally, since the function f is constant on 


each set C; and the distance between any pair of sets is positive, the function 
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Figure 5: Representation of a classifying function 


f is continuous. With these assumptions, we get the following: 


1. There are real numbers C,,...,Cx, elements yı,...,% € Y, a positive 


integer n, elements u;,...,u, of U and e > O such that 


k 
aj — € « (_ ره > [(ع)رلتارسی‎ ٤ 
j=1 


for z € C; and j =1,...,m. 


2. There are a positive integer k, elements y;,...,y, of Y and an h € Dk 
such that 


aj =€< hlyi(x),..., ye(x)] < aj + € 


11 6 3111 om. 


These follow directly from Sandberg's theorem. 
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Figure 6: À classifying network 


This now allows us to use the above approximation network for the pur- 
pose of classification. We require one additional element and that is a quantizer 
Q. This quantizer is simply a real functional Q : IR ب‎ IR such that Q maps num- 
bers in the interval (a;—€, a;+€) to a;. Às long as we choose € < 0.5 E ja;—a;|, 
then this quantizer, when following a network of the structure defined above, 
will allow the correct class to be output. This gives an entire structure for a 
classification network. It is shown in Figure 6. The structure for h as defined 
in part (ii) of Sandberg’s theorem is used in the figure. 

We now turn to demonstrating some acceptable choices for the hidden 
elements in our classification network. In all cases, the complete structure of 
the network is as in Figure 6. No assumption is made about the number n (how 
many elements are necessary) or the determination of the constants c;. We are 
concerned entirely with determining suitable choices for the u; and give several 
examples as well as a justification for each here. In each case, the y; will be 


assumed to be either bounded linear functionals on X or elements capable of 
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uniformly approximating them. 


Polynomial Networks 


A polynomial network is simply one in which each u; is a polynomial. 


In the ridge function form, a polynomial network will be of the form 


» pio Fi )اه‎ 
i 1 
The original Weierstrass Approximation theorem showed that polynomi- 
als were capable of approximating on JR. Now, either Theorem 14 or Theorem 
15 tells us that polynomials, when placed in the network, are capable of solving 


the classification problem. 


Exponential Networks 


An exponential network in which each of the elements u; is of the form 
exp(-) is the most basic to justify as the proof of Sandberg's theorem is based 
on showing first how the exponential functional is capable of being used as 
the nonlinear element and then showing how a function capable of uniformly 


approximating it on a bounded interval is also acceptable. 


Continuous Sigmoidal Networks 


A more complicated but extremely important type of network that is 
useful for classification is a continuous sigmoidal network. It is first necessary 
to define a sigmoidal function. 

Definition: A functional c : JR — R is called a sigmoidal function or sigmoid 
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if 
Jim c(t) = 0 and lim Se 
In 1989, Cybenko (see [8]) proved that for any compact set C C JR", any 
f € C(C), and for any e > 0 there exists a function g of the form 
(2)و‎ => ajo(< vj, x > +0;) (x, y; € IR", 0, € IR) 
3-51 


where o is à continuous sigmoidal function such that 
lg(z) — f(x)| « e for all z € C. 


In other words, this sum of translations and dilations of a sigmoidal func- 
tion is capable of uniformly approximating any bounded continuous functional 
on a compact subset of JR”. Sandberg mentions in [20] that given that the 
statement is true for n = 1, the (i) 4 (1i) section of his proof quickly extends 
the result for n > 1. Indeed, if we let X be simply JIR”, the elements y; be linear 
functionals defined on JE", and u;(x) — c;o(ajz--B;) where cj, aj, B; € IR. This 
gives us a sum of the type desired for n > 1. 

In [5], Cheney demonstrates as a result of the general theory of ridge 
functions that the result is applicable when the elements of the vectors y; and 


the numbers 0, are integers. In fact, the theorem is given as follows. 


Theorem 16 Let g be a continuous function on IR such that the limits of g(t) 
as t — œ and t > —oo exist and are different. Put gi; — g(jt 4- i). Then 


(gi; : 1,5 € Z) is fundamental in C(IR). 
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The proof of this theorem relies on measure theory, making use of the 
Riesz Representation Theorem and the Dominated Convergence Theorem. It is 
beyond the scope of this thesis but can be found in [4]. 

It is seen that this theorem allows g to be a continuous sigmoid, but does 
not require it. The only importance when using the translations and dilations is 
that the limits at oo and at —oo are not the same. It was mentioned earlier that 
often times it is desired that the output of the activation function in a neural 
network be in a certain range such as [0,1]. Sigmoidal functions fit nicely into 
this framework. 

Finally, we can show at once that these shifted and scaled sigmoidal func- 
tions are capable of approximating on any normed linear space by using either 
of the two main theorems after noting that they are capable of approximating 


on JR. 


Squashing Function Networks 


The previous section has dealt with the use of translations and dilations 
of continuous sigmoidal functions. In this section, we will deal with certain type 
of sigmoid that is not necessarily continuous, a squashing function, and attempt 


to obtain a similar result. A squashing function is defined in [12] as follows: 


Definition: A function V : R +> [0,1] is a squashing function if it is nonde- 
creasing, lim W(A) = 1, and lim WV(A)- O. 
A— 00 مو جم‎ 
It is seen at once that this definition simply requires that W be a nonde- 


creasing sigmoidal function (not necessarily continuous). Some useful squashing 
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functions include the threshold function, V(A) > میں1‎ Where 1(.) 1s the indi- 
cator function; the ramp function, V(A) = Algo<a<1y + La>1); and the cosine 
squasher (see {10]), W(A) = (1 + cos[A + 37/2])(1L/2)1g-piy2sasa/2} + 9 
Hornik et al. first define what they call a sigma-pi network and prove 
certain results pertaining to it. Following this, they extend the results to a 
network resembling those that have been mentioned above. We proceed as did 


he, considering only the JR’ case. 


Definition: For any measurable function G mapping JR to IR, let »; [| (G) be 
the class of functions 

q lj 
U : IR e IR: f(x) و‎ Ies 6. vo en Are A qo 
where /; € IN and A is the set of all affine functions from IR to IR, that is, the 
set of all functions of the form A(x) = ws + b where w, b € IR. Networks of 


this form are referred to as sigma-pi networks. 


Definition: For any measurable function G mapping IR to IR, let »; (G) be 


the class of functions 


q 
U : IR S IR: f(x) = ? | B;G(Aj(2)), A A Aq=1 2 lt. 
ge 


This form of this second network clearly resembles the continuous sig- 
moidal network that was shown above if G is taken to be a continuous sigmoidal 


function. The shifting and scaling that was present above is simply performed 
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by the affine functional here; only the notation is different. For now, we will 
continue to let G be any function. 


We now give the main result that applies here. 


Theorem 17 For every squashing function Y, Y;'(V) is uniformly dense on 


compacta in C (JR). 


Proof: We proceed by first proving several lemmas that will aid in the proof. 


Lemma 1: Let G : IR 4 IR be continuous and nonconstant. Then >, II (G) iS 


uniformly dense on compacta in C (JR). 


Proof: We can apply the Stone-Weierstrass Theorem here. Let C C IR be 
any compact set. For any G, Y;[[ (G) is obviously an algebra on C. If zx, 
y € C, x AH y, then we can find an A, € A such that G(Ai(z)) zx G(Ai(y)). 
To show this, pick a, b € IR, a Æ b such that G(a) 4 G(b). Then choose A;(:) 
to satisfy A, (x) = a and A, (y) = b. Then G(A,(z)) # G(Aı(y)). This ensures 
that 3 II(G) is separating. Now we must show that X [J (G) vanishes on no 
point of C. Pick b € IR such that G(b) Z 0 and As(x) = 0۰z + b. For all 
z € C, G(As(x)) 2 G(b) Æ 0, so this is a nonvanishing constant function. The 
Stone-Weierstrass theorem now guarantees that Y? [| (G) is capable of uniformly 


approximating any continuous functional on C. 


This lemma shows that the sigma-pi networks are capable of uniform 


approximation of any continuous function on a compact set regardless of the G 
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with the only requirements that G be continuous and nonconstant. We have 


not yet required that G be a squashing function. 


Lemma 2: Let F be a continuous squashing function and V be an arbitrary 
squashing function. For every e > 0 there is an element H. of $'(W) such that 


sup |F(A) — H,(A)| « e. 
AER 


Proof: Choose e > O and assume without loss of generality that e < 1. We 
must now find constants f; and affine functions A;, j € (1,2,...,Q — 1j such 
that 


sup |F(A) — Y 6,9(A50))] « €. 


Choose Q such that 1/Q « e/2. For j € (1,,2,...,Q — 1), set 8; 2 1/Q. Pick 
M > 0 such that V(—M) « e/2Q and V(M) » 1— e/2Q. Such an M can 
be found because Y is a squashing function. For Jj € (1,2,...,Q — 1], set 
r; =sup{A: F(A) = j/Q}. Set ro = sup{۸ : F{A) = 1-1/2Q}. Because F isa 
continuous squashing function, such r;'s exist. Now, for any r < s, let A,, € A 
be the unique affine function satisfying A,,(r) = M and A,,(s) = —M. The 
desired approximation is then H, = p? B;V (Ae; .-,,, (4)). We can easily check 
that on the intervals (—oo, ri], (ri ri)... JS (ro- s rol (79,00), |F(4) ٤ 


€: 


Lemma 3: For every squashing function V, every e > 0, and every M > 0 there 


Is a function cosy. € 3: (V) such that 


sup |cosme(A) — cos(A)| < €. 
A€[—- M,M] 


ol 





Proof: Let F be the cosine squasher previously defined. By adding, subtract- 
ing, and scaling a finite number of affinely shifted versions of F, we can get the 
cosine function on any interval |-.M, M]. Since F is continuous, we may apply 
Lemma 2 and the triangle inequality to easily obtain the result. Indeed, let G 


be an element of 3)! (W). We then have on the interval [- M, M], 


IG(A) — cos(A)] > |]6)۸( - FA) + |F (A) — cos{A)| 


IG(A) — F(A)| +0 


ME 


where the last line followed from Lemma 2. 


Q 

Lemma 4: Let g(:) = >> B;cos(A;(:)), A; € A. For arbitrary squashing func- 
j=1 

tion V, arbitrary compact C C IR, and for arbitrary e » 0, thereisan f € Y; (V) 


such that sup,ec |g(x) — f(x)| « e. 


Proof: Pick M > 0 such that for j € {1,2,...,Q}, A;(C) C [-M, M]. Be- 
cause Q is finite, C is compact and the A(-) are continuous, such an M can 


Q 
be found. Let Qj; = Q- >; |8j|. From Lemma 3, for all z € C we have 
ja 


Q 
| Y B;cosme(A;(z)) — g(x)| < e. Because cosy./g € Y (Y), we see that 
j=1 


fC) = £a cosu49(A;()) € Ya (V). 


Now we turn to proving the theorem. By Lemma 1, the trigonometric 


Q lj 
polynomials { >> 8; [I cos(A5.(-)) : Q,l; € IN, 8; € IR, Aj, € A} are uniformly 
j=l k=1 
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dense on compact sets in C(IR). By repeated application of the trigonometric 

identity cos(a) cos(b) — cos(a-- b) — cos(a — b), we may write every trigonometric 
T ۰ 

polynomial in the form >> a; cos(A;(-)) where q € IR and A, € A. The desired 
t=] 


result now follows from Lemma 4. 


This now gives us another class of acceptable functions for the u; im 
Figure 6, and choosing a squashing function will ensure that the output of each 


u; is always between 0 and 1. 


Radial Basis Function Networks 


An important type of function that may be used in some classifying 
networks is the radial basis function, and more specifically, the Gaussian basis 
function. While we cannot generalize that in all cases a basis function network 
may be used for uniform approximation, there are some examples that are 
useful. Information about the universal approximation capability of radial basis 
function networks may be found in [17]. We define a radial basis function as a 
function which depends only on the norm of the argument. In other words, if f 
is a radial basis function and ||z|| 2 ||yl|, then f(x) = f(y). 

We now give an example of a case when uniform approximation is pos- 
sible using a radial basis function network. In this particular instance the basis 
functions are Gaussian, functions that have other useful properties for approx- 
imation networks. Let H be a Hilbert space with inner product < -,- > and 
norm || - || defined in the usual way. We are interested mainly in H = JR” 


with ||z|| = 0; 24. Let C C H be compact and let V C H be nonempty, con- 
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vex, and satisfy the condition that for xı, z € C with x, Æ xa there exists 
u € V such that ||xı - ul] # ||x2 — u||. We can, for example, take V to be C 
as long as C is convex, or we can take V to be any nonempty convex subset 
of H containing an interior point. Let P be a nonempty subset of (0,00) or 
(-00,0) that is closed under addition. Finally, let L — (g : C 9 IR : g(z) = 
E a; exp(—a,||z — v;ll),m < 00,0; © R,a; € P,v; € V. It is immediately 
7 that the structure of L is of the form needed for the elements u; in Figure 


6. With these assumptions we get the following theorem. 


Theorem 18 Let f : C — IR be continuous and let e > 0. Then there exists a 
و‎ € L such that 


If(o) ^ g(a)| « 6o € C. 


Proof: Using the property above and the convexity of V, we see that given a;, 


IR, a1, aa E€ P, and v, v € V‏ € وه 
exp(—o»|lz — v2 ||”) = bexp(-(aı + &)||x — w||^)‏ وه( || رن = a exp(—a||z‏ 


for some b € IR and w € V. Also we can see that o4 + a € P. So L is an 
algebra. Choose zı and zz in C and assume that x, É za. Then |lx, — vl| # 
|zo — v|| for some v € V by our first assumption.. Therefore exp(—a||x, — v||) H 
exp(—al||zo — v|) so L separates the points of C. Therefore, by the Stone- 


Weierstrass theorem, the proof is complete. 


Thus, in this somewhat less general compact space, the Gaussian basis 


functions are capable of uniformly approximating any continuous function in 
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IR. They therefore may be used as the elements u; in our original network. 
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7. Applications 


Classifier Example 


At this point we are ready to give an example of an actual classification 
network using the framework that we have provided. This example will also 
show how the mathematical formulations that we have been making relate to 
the problems related to signal classification that were initially discussed. 

Let X be the space of continuous real-valued functions defined on [0, 1]" 
with ||- || the usual sup norm. Let k and r be positive constants and let Lip(k) 
denote the subset of X consisting of the elements of X that satisfy a Lipschitz 
condition: |z(a) — x(b)| € k|a — b| for all a and b. This is a typical way to deal 
with a good class of nonlinear functions. Let £ £2,..., £m be distinct elements 
of Lip(k) and let C} = {x € Lip(k): ||r — z;|| € r) for each 3 2 1,2,...,m. 

Now assume that r < (1/2) miniz; |; — z;||. It is clear that the C; are 
pairwise disjoint if this condition is satisfied. Since each C; is a closed bounded 
subset of X that is equicontinuous on [0,1]", we get a result thanks to the 
Arzela-Ascoli theorem (see [15]) showing that the C; are compact. As we have 
shown earlier, since the C; are compact and pairwise disjoint, the union U; C} 
is also compact. 


We now introduce a theorem in [20] without the proof given there. 


Theorem 19 Let X denote the normed linear space of JR-valued continuous 
functions on Z :— [0, 1]^, with the usual max norm. Let g € X*, and let e > 0. 


Then there are points 0;,...,0, E Z, points c,,...,c, € JR, anda q € X such 
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that 
P 
sup |g(z) — Y; cjz(a;)] « » 
TEC j=1 


and 


sup |g(z) = | q(a)s(a)do| < < 


TEC 

This theorem shows that a classifier can be found in this case using a 
simple sampling and summing operation or an integration. It applies directly 
to our example at hand since we are working on [0, 1)". We now know that it 
is possible to classify the signals in our example using the structure in Figure 
6 where the functional y; performs the sampling and summing or integration 
operation 

This problem is very applicable to the examples discussed earlier. If n = 
1, 2, or 3, we are classifying continuous signals in one, two, or three variables. 
This is the kind of sensor input that we might have in the automatic target 


identification and pattern recognition examples that were mentioned earlier. 


Conclusions 


We have described a specific neural network structure that is capable of 
solving certain classification problems. This structure has the form of a single 
hidden layer feedforward neural network and therefore possesses the advantages 
of neural networks that were mentioned above. It has a simple framework that 
is easily built in hardware or simulated in software. 

It is important to note that there are limitations to the methods pre- 


sented here. All of the proofs are existence proofs. They guarantee that a 
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solution is possible and in some cases give a general idea on how it might be ac- 
complished. For example, we have seen how certain classes of functions such as 
sigmoids and polynomials are capable of being used as the activation functions 
(the u;) in a classifying neural network. What has not been determined is the 
number of nodes needed. We can only say that classification is possible with a 
finite number of nodes. Further, we have not given a certain method of finding 
the weights c; in Figure 6. This is typically what we referred to as training the 
neural network. 

In spite of these shortcomings, we have succeeded in providing a general 
framework capable of studying the important problem of signal classification. 
We have accomplished this by using well-known theorems dealing with approx- 
imation. This area of research is fairly new and has proven extremely useful so 


far, and interest in it will continue to grow in the future. 
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